RDD Using Text File

The textFile method reads a text file from HDFS/local file system/any hadoop supported file system. Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.

textFile(): Read single or multiple text, csv files and returns a single Spark RDD [String]

wholeTextFile(): Reads single or multiple files and returns a single RDD[Tuple2[String, String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the file.

Read text File from HDFS
val rdd = sc.textFile("/FileStore/tables/orders.txt")
rdd.collect.foreach(f=>{println(f)})

Read text File from Local file System
val rdd = sc.textFile("file:///home/hduser/Desktop/Data/data.txt")
Read all text File from Local file System  
If you want to read a entire contents of a file as a single record use wholeTextFiles() method on sparkContext.

val rdd = sc.textFile("/FileStore/tables/*")
rdd.collect.foreach(f=>{println(f)})


RDD From File
val scalaFile = scala.io.Source.fromFile("/data/retail_db/products/part-00000").getLines.toList
val scalaFileRDD = sc.parallelize(productsRaw)
 
Word Count Program
val rdd = sc.textFile("file:///home/hduser/Desktop/Data/data.txt")
val words = rdd.flatMap(x => x.split(" ")).map(x => (x,1))
val word_count = words.reduceByKey((x, y) => x + y)
word_count.collect()

No comments:

Post a Comment